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Abstract 

The Stochastic Block Model ( Holland et al. 1983 1 is a mixture model 



for heterogeneous network data. Unlike the usual statistical framework, 
new nodes give additional information about the previous ones in this 
model. Thereby the distribution of the degrees concentrates in points 
conditionally on the node class. We show under a mild assumption that 
classification, estimation and model selection can actually be achieved 
with no more than the empirical degree data. We provide an algorithm 
able to process very large networks and consistent estimators based on it. 
In particular, we prove a bound of the probability of misclassification of 
at least one node, including when the number of classes grows. 



1 Introduction 



Strong attention has recently been paid to network models in many domains such 
as social sciences, biology or computer science. Networks are used to represent 
pairwise interactions between entities. For example, sociologists are interested 
in observing friendships, calls and collaboration between people, companies or 
countries. Genomicists wonder which gene regulates which other. But the 
most famous examples are undoubtedly the Internet, where data traffic involves 
millions of routers or computers, and the World Wide Web, containing millions 
of pages connected by hyperlinks. A lot of other examples of real- world networks 
are empirically treated in Albert and Barabasi (2002 1, and book Faust and 



Wasserman ( 1994 1 gives a general introduction to mathematical modelling of 



networks, and especially to graph theory. 

One of the main features expected from graph models is inhomogeneity. 
Some articles, e.g. Bollobas et al. (20071 or Van Der Hofstad (2009), address 



this question. In the Erdos-Renyi model introduced by Erdos and Renyi ( 1959 1 



and Gilbert (19591, all nodes play the same role, while most real-world networks 



are definitely not homogeneous. 



1 



In this paper, we are interested in the Stochastic Blockmodel (SBM), intro- 
duced by Holland et al. ( 1983 1 and inspired by 'Holland and Leinhardt ( |1981 1 and 



Fienberg and Wasserman (1981). This model assumes discrete inhomogeneity 



in the underlying social structure of the observed population: n nodes are split 
into Q homogeneous classes, called blocks, or more generally clusters. Then it is 
assumed that the distribution of the edge between two nodes, depends only on 
the blocks to which they belong. Thereby, within each class, all nodes have the 
same connection behavior: they are said to be structurally equivalent (Lorrain 



and White, 19711. When the class assignment is known, the social structure 



can possibly be visualized through the meta-graph (Picard et al. 2009), which 



emphasizes the role of each class. However the block structure is supposed to 
be not observed or latent. Thus the assignment Z and the model parameters 
must be estimated a posteriori through the observed graph X, which is a real 
challenge, especially in large networks. 

Our main purpose in this paper is to present a consistent inference method 
under SBM, which can above all process very large graphs. [Snijders and Now^ 



icki ( 1997 1 have proposed a maximum likelihood estimate based on the EM 
algorithm for very small graphs with Q ~ 2 blocks. They have also proposed 
a Bayesian approach based on Gibbs sampling for larger graphs (hundreds of 



nodes), which they have extended to arbitrary block numbers in Nowicki and 



Snijders (2001). However the usual techniques enables the processing of only 
relatively small graphs, because they suffer severely from graph intricacy. In 
particular the EM algorithm deals with the conditional distribution of the la- 
bels Z given the observations X, whose dependency graph is actually a clique in 



the case of SBM (see paragraph 5.1 in Daudin et al. (2008)). Inspired by Wain- 



wright and Jordan (2008), Daudin et al. (2008) have developed approximate 



methods using variational techniques in the context of SBM. From a physical 
point of view, the variational paradigm amounts to mean-field approximation, 
see Jaakkola (2000). Thus thousands of nodes can be processed with this varia- 



tional EM algorithm. Lastly, Celisse et al. (20111 proves the variational method 
to be consistent precisely under SBM. 

All previous methods treat both classification and parameter estimation di- 
rectly and at the same time. They are alternatively updated at each step of EM- 
based algorithms. Yet those tasks are actually not symmetrical, and moreover 
estimators are quite simple when Z is known. The classification — remaining 
the main pitfall thus far — can be completed first, and then the latent assign- 
ment Z just replaced with this classification by plug-in in order to estimate the 
parameters. 

Searching for clusters from a graph is computationally difficult and has differ- 
ent meanings. Many algorithms, especially coming from physics and computer 
science, aim at detecting highly connected clusters, which are self-defined as 
optimizing some objective function, see Lancichinetti et al. ( 2009[ ) and Girvan 
and Newman (2002). In contrast, the blocks under SBM have a model-based 



definition and do not necessarily have many inner connections (see examples 
Daudin et al. ( 2008[ )). Therefore, most algorithms designed for community 



detection are generally not suitable in this context. 



Bickel and Chen (2009), Choi et al. (2010), Celisse et al. (2011) and Rohe 



et al. (2010) prove that it is asymptotically possible to uncover the latent struc- 



ture of the graph Z. In this work, we additionally show under a mild assumption 
that it is possible to do so, just by utilizing degree data instead of the whole 
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graph X. As a, consequence, we can work with n variables instead of n^, which 
makes classification computations much faster. The basic reason why so little 
information is needed — compared with other models with latent structure — 
is specific to SBM. The number of observed variables {Xij)i<ij<n grows faster 
than the number of latent variables Z, therefore even marginal distributions 
of X concentrate very fast. Our algorithm actually expands the procedure in- 
troduced by Snijders and Nowicki (19971 when Q — 2. Like Bickel and Chen 



(2009), we provide probabilistic bounds for the occurrence of one error at least. 
Moreover we take the random assignment into account, even when the number 
of classes Q increases and the average degree vanishes. Related results are given 
in Choi et al. (20101 and Rohe et al. (20101. Nevertheless the bounds concern 
the rate of misclassified nodes instead, and do not prevent the number of errors 
from growing to infinity as fast as the square root of n for instance. They also 
require the assignment Z to be fixed. 

The paper is organized as follows. In Section [2] we begin by presenting the 
model we shall study and some notations are fixed. Above all a concentration 
property of the degree distribution is stated in paragraph |2.2[ which will be very 
useful in proving the consistency of the method mentioned above. The classifi- 
cation algorithm (called LG) and the main results are presented in this section 
as well. In particular. Theorem |2.2| provides a bound of the error probability 
and Proposition |2 . 2 . l] gives some convergence rates when the number of classes 
is allowed to grow. The consistency proof of the LG algorithm is provided in 
Section [3] Section |4] is devoted to deriving simple estimators of the parameters 
by plug-in and their consistency is also demonstrated. A simulation study in 
Section [5] illustrates the behavior of the LG algorithm, which is discussed after- 
wards. In Section |6] the model and the algorithm are more accurately studied. 
As an application, it is lastly proved that it is likewise possible to find out 
asymptotically the right number Q of blocks of the model. That completes the 
method relying just on degrees. 



2 The Stochastic Block Model 
2.1 Model 

We first recall the SBM model. For all integers n > 1, [n] denotes the set 
{1, . . . , TT,}. The undirected binary graphs with n nodes are defined by the pair 
{[n], X) where X is a symmetric binary square matrix of size n. X is called the 
adjacency matrix of the graph. Let Q > 1 be the number of blocks. 

• Z = (Zi)jg[„] denotes the latent vector of [Q]" such that Zi= q\i the node 
i is g-labeled. Let a = (ai, . . . , ag) be the vector of the block proportions 
in the whole population. 

Z = {Z,), i.i.d. ~7W(l;a) 

• Conditionally on the labels Z, the variables {Xij,i, j € [n]} are inde- 
pendent Bernoulli variables. Conditionally on {Zi = q,Zj — r}, the 
parameter of Xij is iTgr . 

= q, Zj = r) Bi-Kgr) 
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TTqr IS the connection probability between any g-labeled node and any r-labeled 
node. Noting tt = {'^qr)q.relQ] connection matrix, the parameters of the 
model are {a,Tr). This model will be denoted by Q{n,Tr,a). Note that in the 
sequel n will be often removed in the notations for the sake of simplicity. 

This is a classical problem in mixture models: the block labeling is naturally 
not identifiable. The content of the blocks remains unchanged by permutating 
labels. But equivalence classes are identifiable as soon as n > 2Q, see |Celisse| 



et al. (2011). 



2.2 Degree distribution 

For all i E [n], let Z?" = -^ij the degree of the node i, that is the number 

of neighbors of this node. 

Proposition 2.0.1. For all q £ [Q], let = J2rGlQ] '^'''^i^' binomial 
distributed random variable conditionally on Zi = q with parameters (n — l,7f^). 

{D^)i£[n\ is therefore a sample of a mixture of binomial distributed random 
variables with parameters {n— ^,T^q)q£[Q] and proportions icitq)q^\^Q\. 

These variables are correlated. Thus we are not in the validity range of the 
usual algorithms for mixtures like EM. But there is only one edge shared by any 
pair of nodes and the degrees are consequently not heavily correlated. Using the 
EM algorithm would make sense for practical purposes. Nevertheless we have 
chosen to use a faster one-step algorithm, unlike EM which is iterative. 



A concentration inequality for binomial random variables 

The following inequality will be useful throughout the article. This will es- 
pecially account for the fast concentration of the degree distribution. It is a 
straightforward consequence of Hoeffding's inequality for bounded variables. 

Theorem 2.1. (Hoeffding) Let n > 1, p g]0,1[ and (Ki)jg[„] a sequence of 
independent identically distributed Bernoulli random variables with parameter 
p. Let Sn = Er=i Then for all t > 0: 

>t] < 2e-2"*' (CCT) 




Concentration property of the normalized degrees 

Define the normalized degree of node id [n\: 

' ~ 1 

(T")ig[„] cluster around their average conditionally on the node class when n is 



increasing, according to ( CCT ) 



P{\T:'-7fq\>t\Z, = q)<2e-'^' (1) 

Hence normalized degrees corresponding to g-labeled nodes gather around 
Wq. Consequently, in the degree distribution, nodes from different classes split up 
into groups centered around Wq, provided that all conditional averages iWq)q^[Q] 
are different. From now on, we will assume that they are: 
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Assumption 

Vg, r e [g] qi-r^ ifq ^ 7f,. (A) 

Note that, if it is known that two classes have the same conditional average, 
it is possible to resort to the concentration of another marginal distribution: the 
distribution of the number of common neighbors for each pair or nodes. Refer 
to Appendix |B] 



2.3 Largest Gaps Algorithm 

Because of the concentration, a larger gap is expected between normalized de- 
grees of nodes from different classes than nodes from the same class. The follow- 
ing algorithm relies on this remark. It consists in building Q blocks by finding 
the Q — \ largest intervals formed by two consecutive normalized degrees. 

If is a sequence of real numbers, (w(i))ig[n] denotes the same se- 

quence but sorted in increasing order. 

Algorithm 

• Sort the sequence of the normalized degrees in increasing order: 

^(1) < • • • < Ti^n) 

• Calculate every gap between consecutive normalized degrees: 

- ^(i) for all i e [n - 1] 

• Find the indexes of the Q — 1 largest gaps: ii < ■ ■ ■ < iq-i, such that for 
all fee [Q — 1] and for all i e [n] \ {ii, . . . , iq-i}: 

• Noting (io) — and (iq) — n, associate with each index (i) a class number: 

k such that (ik-i) < (i) < (ik)- 



Example 

On the figure below, the largest gaps correspond to the intervals [r(2),T(3)[, 
denoted by ®, and [T(g),T(io)[, denoted by ®. Nodes (1) and (2) are therefore 
classified in class 1, nodes from (3) to (9) in 2, nodes (10) and (11) in 3. 

Figure 1: Repartition of the normalized degrees 
□ : Class 1, 0: Class 2, Q: Class 3 



® 



□ □ 



^(1) T(2) 



T, 



(3) 



T, 



(9) 



^(10) 7^(11) 1 



This algorithm has all the qualities mentioned in Introduction and makes 
good use of the concentration, which makes the consistency easy to prove. 
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Whereas variational EM algorithms runs as many quadratic steps as needed 
to reach convergence and classical spectral clustering runs in cubic time, this 
algorithm is especially fast. Indeed the sorting runs in quasilinear time and 
although the computation of the degrees is quadratic, this is a very basic op- 



eration which is very quickly performed. Note that Condon and Karp (20011 
gave an algorithm running in linear time and consistent under SBM — called 
planted ^-partition model in this paper — , but provided that the weights of the 
blocks are equal. 

Nevertheless this algorithm seems to be relatively naive because it takes 
every normalized degree into account and each one carries the same weight, 
even if it is isolated and not statistically representative. In the worst case, one 
point is sufficient to trick the algorithm yet makes the classification wrong by a 
majority, especially at low graph sizes. 

2.4 Main results 

The true (respectively estimated) partition of [n] in classes is denoted by the set 
{Cg}q^[Q], (resp. by {Cg}qg[Q]) and the cardinality of the true g-labeled class 
by NJ^ (resp. by NJ^). We expect the estimated partition to be almost surely 
the true partition when n is large enough. Define £"„ as the event "The LG 
algorithm makes at least one mistake", that is: 

En = {{qh + {Q},} 

Definition 1. {C^}gg[Q] is sazrf to be consistent if 

Pl,{E^) 1 

n— f oo 

Definition 2. Define 5 the characteristic minimal gap (or separability) of the 
model in the following way: 

5 = min |7fq — Tf^l 

Finally, let us define ao the smallest proportion of the model. The classifi- 
cation is harder for small values of a^: 

an = min 

Theorem 2.2. Under Assumption ([A|), 

P^,,(£;„) < 2ne-5«^' + Q(l - ao)"+i 

Section [3] contains the proof of this theorem. The most important parameter 
is 5: the smaller it is, the harder the separation between the classes is, and so 
the larger n must be to retrieve the true partition. 

Convergence rates 

In order to derive orders of magnitude of n to achieve convergence in Theorem 
|2.2[ we choose another asymptotic framework only in this paragraph, where 
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the parameters are functions of n. Consistency does not mean convergence 
under the distribution of Q{n, a, tt) anymore, but under G{n, a", tt"), with a" = 
(a", • ■ • jCKq„) and tt" = (7r^r)i<9,i-<Q„ • We assume that: 



^ 0, a^i 



^• and Qr, 



+00 



Proposition 2.2.1. The inference method with LG algorithm is still consistent 
under the following assumptions: 



n— >+oo 



(a) lim (5„W7— >2\/2 
"Inn 



(b) Qn = O 



n 
Inn 



/ 1 ,• nln(l - Qfo) 

(c) hm ; > 1 



for example, if Qn = 1 



n 
In n 



, if is sufficient that: Ofg > 



Inn 
'2n 



Proof. Assumption Q implies that there exists C > 2\/2 such that for n large 
enough: 

I 77 77 

(^nA/ ^ > C and then -^-8>C2-8>0 

V In n In n 



Therefore 



n exp 



-nSt 



~ exp 
< exp 



Inn 



In? 



--\nn{C'-8) 



n— f+oo 







Secondly (|A]) requires {Qn — l)(5,i < 1 as a necessary condition. Hence, applying 
the first inequality: 



}n<l+^=0 
On 



n 
Inn 



According to Assumption ([c]), there exists C" > 1 such that for n large 
enough: 

nln(l-a{f) 



InQ^ 



> C', so that: 



Qn{l - a^T = exp [lnQ„ + nln(l - a^)] 

/ -nln(l-ag) ^ 



exp 



<exp(-lnQ„ (C'-l)) 



^ 



□ 

Large graphs are more and more sparse as n increases, which results in the 
decrease in the connectivity defined by 7f„ — £0,^ -^1 (T"). 
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Proposition 2.2.2. The LG algorithm is still consistent in the following cases: 
1 \ 3/2 



• while [ ) = 0{'!Tn), if Qn is bounded. 



• while \l —— = Oi-Kn), ifQn ^ \J]^- 



'Inn 
n 

Proof. We sketch the proof with the following inequality, which estimates the 
connectivity of the sparsest model: 

q=l q=l 

□ 

3 Consistency proof of the LG algorithm 
3.1 An ideal event for the algorithm 

The LG algorithm delivers the true partition especially when none of the classes 
is empty, and the spreading of the normalized degrees is small compared with 
the minimal gap S. An denotes the event "No true class is empty", that is 

An = fi {q ^ 0} - n {N- = 0} 

■ze[Q] 96 [Q] 

Definition 3. We call maximal intraclass distance (or spreading) the random 
variable dn defined by: 

dn — max sup |T!" — Tfq\ 

This is the maximal distance between the normalized degree of a node and 
its own conditional mean, over all nodes and all classes. This is basically a 
measurement of the within-class spreading of the normalized degrees. 

Proposition 3.0.3. Under Assumption ([A|), the following inclusion holds for 
all e > 0: 

Proof. Suppose that An H {dn < is true. For all i,j e [n] and q,r £ [Q]: 
• If nodes i and j have label q, then: 

m-T,\<\T,~7f,\ + \Tj-7fg\< 



4 + £ 

• Inversely, if they have different labels, respectively q and r, then: 



'T,\>\T, 






> \Tj 


-7fg| - 


S 

4 + £ 


> \7fl 


-Tfq\ - 


\T,~ni\ 


>d- 


s 


5 


4 + £ 


4 + e 



4 + £ 



2 + £ ^ 26 
(5 > 
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As a conclusion of this alternative, i and j are in the same class if and only 
if \Ti — Tj\ < Notice moreover that there exists exactly Q — 1 intervals 

among the set {\Ti,Tj\)i j strictly greater than on this event. Hence the 
Q — 1 largest intervals lie between groups of normalized degrees from different 
classes; whereas all others lie between degrees of the same class. In this case 
the algorithm returns the true partition. 

□ 

3.2 Bound of the probability of large spreading 

In this paragraph we shall show that the dispersion dn converges to thanks to 
the subgaussian tail of the binomial distributions. This is a basic result of this 
article, because all others require controlling the dispersion. 

Proposition 3.0.4. For all t > 0; 

P{dn >t)< 2?ie-2"*' 
Proof. It consists in conditioning by the class of each node, in order to apply the 



concentration inequality (CCTi, and of a union bound. Since ~ B{n,iTq), 
( |CCT[ ) gave the inequality ([1|: 



Hence: 

P{dn>t)=E{P{dn>t\Z)) 

= E{P (u,e[Q] U,ec, m - > t}\Z)) 

\qe[Q] iec, 
<E [ ^P(|T,-7f,| >i|Z, = g) 

\gG[Q]ieC, 

□ 

Remark. Furthermore dn almost surely converges to because the upper 
bound is summable, by applying a usual consequence of the Borel-Cantelli 
lemma. 



3.3 Bound of the error probability (proof of Theorem 2.2) 

Thanks to the bound of the probability of large spreading, one can easily con- 
clude that the ideal event An H {d < is actually strongly likely for n large 
enough and for all e > 0: 
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Proof. First we have A„ H {d < -Ar} C En according to Proposition 



4+e 

hence: 



3.0.3 



P{En) < P (^A„ n {d„ < ^} j <Pydn> — ) + P(AO 

On the one hand, Proposition |3 .0 .4] iniphes that: 

P (dn > -^--] < 2exp I -2n ^ 



4 + e/~ \ V4 + e 

On the other hand A„ corresponds to "There exists an empty class". For all 
q & [Q], Ng B[n, a^), hence: 

P{An)^P{yjge[Q]{N, = Q}) 

< ^ P(iV, = 0) = ^ (1 - a,)" < Q(l - ao)". 
9e[Q] 96 [Q] 

Once the both previous inequalities have been put together, we have an upper 
bound of P{En) which depends on e. The limit of the upper bound when e 
tends to zero yields the bound of the Theorem. □ 



4 Consistency of the plug-in estimators 

If the true classes were known, the usual moment estimators would be enough to 
estimate (a, tt). Indeed the empirical proportions estimate a and the connection 
frequencies estimate the connection probabilities. We first prove that if we knew 
the classes, we would obtain a consistent estimate. However those variables are 
not observed but latent. That is why we plug the partition delivered by any 
consistent classification algorithm into these estimators. Notice that it does not 
depend on the choice of the consistent algorithm. 

Notations For all q^r in [Q], Cqr denotes Cq x Cr, and Ngr its cardinality. If 

g ^ r, Ngr = NqNr and if q = r, Nqg = M^^Lzll^ We define the following 
estimators: 

5q = and TTgr = ^ 
Recall that all of these variables are hidden thus far. 



4.1 Estimation with revealed classes 

Theorem 4.1. (a,7r) is a consistent estimator of [a,?:). 

Proof. For all q G [Q], Ng is the sum of n independent Bernoulli random vari- 
ables with parameter ag. Applying directly the concentration inequality, we get 

for alH > and q e [Q]: P 



>tj< 2e-2"* . Applying the concen- 
tration inequality (CCT I conditionally on Ngr and then taking the expectation, 
we get for all t > 0: 

P {\^gr - TTqr\ > t) - E [P {\ligr - ^,.1 > t\Ngr)] < 2E (e-^^'"*') 
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Define: 



aqr = aqOr if q ^ r and Uqq = ^ if q = r. 

Let (r„) be a non-negative sequence tending to infinity. We split up the support 
of the expectation into two pieces, depending on the values of Nqr- On the one 
hand the exponential term inside the expectation is bounded on the first piece 
of the support by a deterministic sequence. On the other hand, the probability 
of the support of the second piece of the expectation {jA^gr — ckgr"-^| > ?•„} is 
accurately controlled by using the concentration inequality derived from ( CCT ) 
in Appendix \K\ 

E [exp{-2Nqrt^)] =E[exp{-2Nqrt^)l^\N,^_^,,^n^\<,^} 

+ exp(-2A^grt^)l{|Ar_^^_„_^,_„2|>^^}] 
< E [exp{^2t^{aqr.n'^ - r„))] + P{\Nqr - a^^n^l > r„) 



< exp{-2t^{aqrn^ - r„)) + P 



< exp 



4 exp 



Oiqr 

1 

_ 

~2n^ 



> 



(B) 



In order to have a vanishing bound (IB]), we just have to choose (r„) such 



that: 



lim ^>iand^ 



= n^/'*, hence: 



Tl'^ n— f+oo 



^ +00 



For example, r, 

E [exp{-2Nqrt^)] < exp -n'^/H^ (n^^^al - + 4exp 
Then we conclude with a union bound: 

Finally we conclude for all parameters: 
P{\\{^,a) - (a,7r)||oo > t) < 2Q^ (^e-"^/^*^("^/^-^l) + ^^-"^^'^ + 2Q 



□ 



4.2 Estimation with hidden classes 

We now assume that we have got a partition of the nodes {Cq}q returned by any 
classification algorithm. The estimators a and tt are defined by plug-in with the 
estimated partition {Cq}q instead of the true one {Cq}q. If the classification is 
right, then estimators both with hat and with tilde are equal. 

Sn ~ — - and TTar — — Xii 
n Nqr ~ 



11 



Theorem 4.2. If {Cq}q is consistent, then (a,Tf) is a consistent estimator of 
{a,TT). 

Proof For all t > 0, let Bj" ^ {\\{a,n) - {a,7r)\\ > t}. 

yt>o p{Bi') = p(St" n En) + p(Bt" n £;„) 

< P(Sr n E„) + P{En) 

On the event i?„, the equality (S,7t) = (3,7?) holds, hence: 

Vf>OP(Bn <P(||(5,i)-(a,^)|| >t) + P{E^). 

The first term converges to according to Theorem |4.1| and the second one as 
well, provided the algorithm is consistent (see Theorem 2.2 1. □ 

4.3 Conclusions 

The previous paragraphs did not depend on the algorithm chosen. Now putting 
together the results of the previous section and the results concerning the LG 

algorithm, we get: 

Theorem 4.3. For all t > 

P{\\{9,a) - (a,7r)||oo X) < 2Q2 (e-"'*'K-""''') + 46^5^) + 2Qe'^"'" 

+ 2ne-*"*' +Q(l~ao)" 

Note that the estimation procedure requires larger graphs to achieve consis- 
tency than does the classification procedure with the LG algorithm alone. This 
is basically due to the variability of the empirical proportions. Since the upper 
bound is summable, a usual consequence of the Borel-Cantelli lemma implies 
the strong consistency of these estimators. 

Discussion. We now consider the asymptotic framework fj(n, a", tt"), as we 
already did in paragraph 2.4 The previous bound above is very interesting 
when Umao > and then lim(5„ < +00, because it allows strong consistency 
for example. If we want just consistency, we can change the bound so that 
the convergence rates of (q;q) and (Qn) are more optimal in our asymptotic 
framework. 

Proposition 4.3.1. The inference method with LG algorithm is still consistent 
under Assumptions (|a]), (|b]) and where 

1/4 



(d) lim a'i {- ) > V2 



Proof. First of all, we consider the bound ([B| in the proof of Theorem 4.1 and 
this time, we take r„ = \/ Arfi Inn, so that it yields the following bound: 

P(|l(i^,5)-(a",^")||oo >t) < 



2Qi exp 



-2iV7i3 1n7i (^(a^)^y^ 



4 Inn ^ 



2Q„e-2"*" (B') 
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Assumption (|b| is sufficient to show the convergence 
Assumptions (bj) and Q have to be proved sufficient for the remaining term of 
the bound (B' I. Assumption (|d| impHes that there exists C > such that for 



n large enough: 



1/4 I 7, (72 

>C, hence (aS)2./- i> i>o 

In 71/ V 41nrt 2 



It is easily deduced from this that the ffist term of the bound (B') therefore 
converges to zero. 

Moreover, note that the convergence of this term implies the convergence 
of (3„(1 — Qfg )" as well. Recall that Assumption Q implies the convergence of 
2g-gni5„^ As a conclusion, the consistency holds. 

□ 



5 Simulation study 

Our main purpose in this study is to figure out how the LG algorithm behaves 
in practice, and above all, to check whether the bounds of Theorem |2.2| are 
pessimistic or not. The empirical frequency of the graphs with no error would 
be of great interest, because that is the quantity the bound concerns. But 
actually this frequency has no smooth evolution: it suddenly shifts from to 
almost 1. We shall use two types of error rates: a global one and one for each 
class, so as to examine more accurately the results given by the algorithm. 

5.1 Simulation design 

The parameters used in the simulation are: 

/ 0.95 0.4 0.4 \ 
a = (0.3 0.6 0.1) TT = 0.4 0.7 0.75 

\ 0.4 0.75 0.65 / 

Hence 7f = (0.565 0.615 0.635) and 5 = 0.02. 

The evolutions of the classification error rates and the estimators with re- 
spect to the number of nodes n are averaged over 1000 graphs drawn from 
C/(n, Q;,7r) and displayed from 1000 to 60000 nodes. 

First of all, the global error rate gn is defined as the proportion of node pairs 
(i,j), either classified in distinct classes whereas their true labels are identical, 
or classified together whereas their true labels are different. That is, denoting 
Z the label vector returned by the LG algorithm: 

g„(Z,Z) ^ [^Z'=z,H,^z-,+^z^^zAz-^=z-) 

^ ' l<i<j<n 

Secondly, we also propose error rates for each class. Define Iq, resp. Afg, the 
rate of intruders (or false positive rate) in the class q predicted by the algorithm, 
resp. the rate of missing nodes of the true class q (or false negative rate): 
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The algorithm gives labels to the nodes in order of increasing degree. Indeed 
the true labels are expected to be sorted this way, because Wi < 7f 2 < Tfs. This 
partially solves the label switching problem which arises when trying to identify 
the true labels instead of the equivalence classes. 

5.2 Results 

Figure 2: Evolution of the average global error rate <?„ as a function of the graph size 



0,7 - 
0.6 - 




The evolution is quite satisfactory because the error rate completely vanishes 
from n = 45000 nodes, which is even earlier than expected from the bound of 
Theorem |2.2[ Indeed this bound predicted that the probability of at least one 
error would not be less than 0.05 earlier than n = 300000. The bound seems 
to be pessimistic, basically because of the union bound, used in the proof of 
Proposition |3.0.4[ After a dramatic decrease at the beginning, the evolution 
encounters a slight stagnation between n = 10000 and n — 20000 nodes. An 
interpretation of this transitional phase can be given with the error rates for 
each class. 

Figure 3: Error rates 7" and M" 




01Z345S 0123456 
Number of nodes 4 Number of nodes 4 
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The first class is much better detected even at low graph sizes, unlike class 2 
and class 3. Indeed it is sufficient that the maximal intraclass distance d„ is less 
than (7f2 — 7fi)/4 to detect this class, whereas the other two are not supposed 
to be separated before 

7f3 - 7f2 (5 7f2 - 7f 1 

according to our previous study. That is the reason why the global error rate 
dramatically decreases until reaching n — 10000 nodes, and why it does not 
vanish before reaching n = 25000. Note that the bound of Theorems |3.0.4| 
and |2.2| had not predicted this before reaching n — 50000 and n = 264000 
respectively. 

Figure 4: Estimators 




012 345 S' 01 2 345 S 

Number of nodes i Number of nodes i 

Mean of tt Standard deviation of rf 



In short, as long as the tails of the normalized degree distribution are over- 
lapping, the classes are mixed and cannot be properly detected. The curves 
show in particular that many nodes of class 2 seem to be caught by class 3. 
Indeed there are many intruders from class 2 in class 3. The missing nodes of 
class 1 are likely caught by class 2. As a consequence, the proportion of classes 
1 and 2 are underestimated in the transitional phase, whereas the proportion 
of class 3 is overestimated. The inversion of classes 2 and 3 is shown again on 
graphic 4.1, as on 3.1. 
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6 Model selection 



Up to this section, the number of classes was supposed to be known and was an 
input parameter of the LG algorithm. Our main purpose hereafter is to examine 
more accurately the sequence of the gaps sorted in increasing order and then 
the sequence of the intervals between the means of the groups given by the LG 
algorithm, depending on the selected number of classes Q for the model. As an 
application of this study, we finally show that degrees are likewise sufficient to 
asymptotically select the right number of classes. 



6.1 Study of the gap sequence 

We will use the same notations as in the last section. Moreover Qo denotes the 
true number of classes, and Q the current input parameter of the LG algorithm. 
We will often use the event i?„ = A„ n < |}, where no class is empty 
and the dispersion c?„ is so small that the Qo ^ 1 largest intervals separate the 



true classes (see Proposition 3.0.3 with e = 1). Then we can affirm that two 
normalized degrees are in the same class if and only if their distance is less than 

Let (G^')qg[„„i] be the sequence of the distances between consecutive nor- 
malized degrees [T^^j^^-^ — r^"-))ig[„__i], but sorted in decreasing order: 

G'/ > > . . . > G^i 

The Qo — 1 largest gaps in the LG algorithm have lengths Gi, . . . , Gq(,_i. Define 
also {lq)qe[Qo-i] tlie sequence (7f(,+i) -t^ (q)) qe[Qo-i\^ sorted in decreasing order. 
This is called the sequence of the theoretical gaps. The following theorem states 
that largest empirical gaps converge to the corresponding theoretical gaps, which 
enforces our intuition about the model. 

Theorem 6.1. For all q < Qq, G„ > 7„ a.s. 

Refer to Appendix [C] to see the proof. One can easily realize that the only 
gap (among the Qq — 1 largest) lying between 7F(q) and 7r(q+i) converges to 
7f(g_i_i) — 7f(g). However the index of this interval is random and depends on n. 
This interesting but technical problem is solved in the second part of the proof. 
For the moment we provide a weaker version of this theorem, the proof of which 
is much simpler. Its conclusion is sufficient for our purposes. 

Theorem 6.2. For all q < Qq, lim Gq > 

Proof, li q < Qq: on the event Bn, the Qq — 1 largest intervals necessaxily 
lie between normalized degrees from different classes. There exists i G Cr and 
j € Cs, where s ^ r such that Gg = jT^ — Tj\. But \Ti — Tf^l < dn and 
\Tj — Tfsl < dn, hence 

Gq > \Tfr - ns\ - 2dn > S - Is ^ Is > 



5 5 



Namely S„ C {Gq > U} 



9-5' 
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As the upper bound is summable, according to the Borel-CantelU lemma, 
Therefore hm^^,,,^ Gq> |^ > almost surely. 

□ 

All further gaps lie between degrees of nodes of the same class and then 
converge to zero. The next theorem gives an estimation of the convergence rate. 

Theorem 6.3. For all j3 e]0, 1[, the triangular array 

{n^Gl.,Qo<q<n-l} 
converges uniformly w.r.t. q and a.s. to zero when n tends to infinity. 
Proof. First of all, recall that for all n, 

G^„>GS„+i>--->G^i>0 

Therefore we can just prove that n^Gn^ > 0, and the uniform conver- 

n— ^+oo 

gence will follow. 

On the event -B„, the Qo — 1 largest intervals lie between normalized degrees 
from different classes. The next intervals lie between degrees from the same 

class, and the distance to their corresponding conditional mean is at most d„. 
As Gqj, is one of these, Gq^ < 2(i„. Hence, for all < i < g: 

< P(2n^d„ >i) + P(B„) 

< 2(6-5"''*' + e- A"*') + Qo{l - ao)" 

□ 

6.2 Study of the intervals between estimated classes 

By distances between estimated classes, we mean distances between empirical 
averages of the normalized degrees of each class, provided by the LG algorithm. 
Define niq to be the average of the normalized degrees of the g'-labeled class 
estimated by the algorithm: 

The sequence of the gaps between consecutive averages (m^^^x) ^ ™(q))ge[Q-i] 
is sorted in order of decreasing length, just as the sequence of the gaps — 
'^{i))ie[n-i] is in the previous paragraph. This new sequence is denoted by 
(iJ^)gg[Q_i]. Of course it depends on the current Q, whereas {Gq)q does not. 

When Q = Qo, Hq and Gq are very close for all g < Qo — 1- On the contrary, 
when Q < Qq, some of the (i?g)gg[QQ_i] stretch over several classes and include 
more than one of the Gq. As a result, there is at least one q such that Hg 
asymptotically differs from Gq. 
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Theorem 6.4. 



Q-i 



1. IfQ^Qo, then J2 (Hg-Gq) 



9=1 



n— >+oo 



^ 



Q-i 



2. IfQ< Qo, then lim E (^9 " Gg) > 



n— ^+00 q^l 



Proof. Let (J'g)qe[Qo-i] ^^^^ Qo ~ 1 largest intervals between consecutive nor- 
malized degrees, hence for all q, \Jq\ — Gq. Define also Jq = [0,minjg[„] Ti[ and 
Jq — [maxjg[„] Ti^l[. The union of Jq, Ji, . . . , Jg-i, Jq partially covers the in- 
terval [0, 1[. These intervals are separated and the distance between the bounds 
of consecutive intervals is at most 2(i„. As a result: 



1 - 2Qodn < J2 Gg + Ho + HQ<l^J2Hi 
9=1 9=0 

Q = Qo Subtracting the right-hand side (which actually equals 1), we deduce 
from both previous inequalities that: 

Qo-i 

-2Qodn < 51 (^9 - -^9) ^ 

9=1 



The first assertion follows directly from this inequality; for all t > 0: 
>t] < P{2Qod„ > t) 



P 



Qo-l 

9=1 



< 2exp —2n 



t 

Wo 



2 exp 



1 

Wo 



Q < Qo Subtracting the right-hand side from the second inequality only yields 
this time: 

Qo-i Q-i 

E Gq<Y,iHq-Gq) 

q=Q 9=1 



But as shown in Theorem 6.2 the lower limit of Gq is non-negative for all 
9 ^ Qo ~ 1- ^ fortiori, the second assertion of the theorem |6.4| stands as 
well. 



□ 



6.3 Application to model selection 

The summed differences Y^'^^i i^q ~" ^q) examined in the last paragraph have 
an interesting property regarding model selection: when Q is the right number 
of classes, it converges to zero, and when Q is too small, it converges to a non- 
negative value, because one of the Hq does not match Gq. Thus this quantity 
measures the risk of underestimating the number of classes. 
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However, its minimization over all Q g {2, . . . , n} yields the unexpected 
solution Q = n, for all Qq. Therefore we have to penalize overly small gaps 
between normalized degrees. We chose to use an ad hoc penalty, that can be 
easily inferred from our previous study, in order to have a correct estimate of 
Qo- Define for all Q € {2, ... , n}: 

Q-i . 
Jq = ^(i/, - G,) + e [0,+oo] where /3 e]0, 1[. 

Theorem 6.5. 

1. IfQ^Qo, thenfQ-^^O 

2. If Q < Qo, then lim /g > a.s. 

3. IfQ> Qo, then fn > +00 

n— ^+00 

It follows that Q — arg min fn > Qq a.s. 

2<Q<n ri— 5-+00 



Proof. If Q = Qo Applying Theorem 



6.4 



a.s. to 0. According to Theorem 
Therefore: 

1 



the sum Yl'^=ii^q ^ ^q) converges 
lim Gqq_i > almost surely. 

n— f +00 



B.2 



If Q < Qo According to the second assertion of Theorem |6.4| the lower limit of 
the first term is non-negative. There is no change by adding the second 
term, because it is positive. Hence: 

lim /q > 

If Q > Qo The sum Y^^=i ~ lower bounded by -1 (notice that it is 



even positive), and according to the second assertion of Theorem 6.3 
(n^~GQ_i)„ uniformly converges to 0, as soon as g > Qo. The last 
assertion follows. 

□ 



7 Conclusions 

Unlike most of the methods known thus far, the LG algorithm is able to process 
very large graphs. In fact it provides good results only for such graphs. However, 
in practice, the algorithm is efficient even for smaller graphs than theoretically 
expected. Moreover it is self-sufficient: it provides consistent methods for node 
clustering, parameter estimation and model selection. Lastly, this algorithm is 
free from any preliminary setting. Consequently there is need neither for any 
prior knowledge nor for multiple runnings of the algorithm. Thus it can quickly 
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provide good initialization values for other algorithms which depend severely on 
them. 

Above all, the LG algorithm performs every task using the degree data alone. 
As a conclusion, the degree data asymptotically includes the information needed 
to achieve all of the statistical inference in this model. 



A Concentration inequality for products of bino- 
mial distributed variables 

Proposition A. 0.1. Let X (respectively Y) be a sum of n independent hernoulli 
distributed variables with parameter p, respectively q. Then for all t > 





XY 




( 











Proof. 



P 



XY 



P 

>t\=P 

< P 

< P 



X 



P , 

n n 



Y 



1 



X 

n 
X 
n 



Y t , 



9 ) P 

Y 

n 



> t 



P > 



>\\+P 



Y 



> 



< 2 X 2exp ~2n 



4exp(--nt^) 



The last line is obtained by applying the usual concentration inequality ( CCT ) 
to both X and y. □ 

With a similar proof, we prove that for alH g]0, 1/4]: 

:{x - 1) 

2rP- 2 



P 



>t \ < 4cxp (-2r7,r 



B Separation of mixed classes 

Suppose that there are Q classes and Tf^ = Tf^ for some q and r. For the sake 
of simplicity, all other conditional averages are assumed to be pair wise distinct. 
The LG algorithm is supposed to be previously applied to the graph with the 
input parameter Q — \. Let us point out that the Q — 1 groups returned by 
LG are asymptotically the true classes, except classes q and r, which are mixed 
together in one group of nodes, denoted by M. We shall briefly explain a 
procedure to separate this group, using the concentration of some additional 
binomial variables, namely the number of common neighbors of each pair of 
nodes. 



Notation. Define a the diagonal matrix the diagonal coefficients of which are 
(aq)^g[Q] and the bilinear map on M"^: 
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which is a scalar product, as soon as Uq is non-negative for all q. || • ||a denotes 
the associated norm. 

For all pairs of nodes G M x M, define 

Dij = Yijk, where Yijk = Xi^Xj^. 

Yijk is a Bernoulli distributed variable, that equals one if and only if i and j 
are both connected to k. Its parameter conditionally depends on each class of 

nodes i and j: 

• If i and j both belong to the g-labeled class: 

P{Yijk = l\Zi = Zj = q) = Y.anTl = ||7r,||| 

1=1 

where tt, is the row vector {'Kqi)i. Symmetrically, if they both belong to 
the r-labeled class, the parameter is ||7rr||Q. 

• Otherwise, if they belong to distinct classes q^r: 

Q 

PiYijk = MZi = q,Zj =r) =^ ailXql-Krl = {TTq,'Kr)a 

1=1 

The behavior of the new variables Dij looks like that of the degrees; they 

once more quickly concentrate around their average vahic as a consequence of 
the concentration of binomial variables. There arc three groups of node pairs, 
concentrating around ||7rqj|^, ||7rrl|^, or {-Kq,TTr)a- 

Suppose that ||7rg||(, < ||7rr||a. Applying the Cauchy-Schwarz inequality, 

< {■Kq,-Kr)a < ||7rg||a||7rr||a < ||7rr||^ 

The case of equality in the Cauchy-Schwarz inequality cannot arise; if it did, 
then TTq and tt^ would be collinear vectors. Noting c the constant of coUinearity, 
we would get 7Fg = ctF^. But tF^ and Tf^. are assumed to be equal in this section; 
hence c = 1. TTg and tt^ would be equal. This is not allowed by the model for 
identifiability reasons. The inequality is finally strict, which especially implies: 

< {■Kq,-Kr)a < l^rUa 

The furthest group to the right on the real line consequently contains only 

pairs of nodes of the same membership, which is sufficient to solve the mixing 
problem. We just have to extract this group from the other two by using the 
LG algorithm with Q = 2 as input parameter. Define W as the set of the pairs 
which are in this group, and F as the set of nodes, which are involved in those 
pairs. Let K be the graph defined by {F, W). There are three cases: 

• If {■nq,TTr)a < IKqlU < IKrlU and WlTqWa - {TTq,'!rr)a < |kr|U - \Wq\\a, K 

asymptotically forms one clique composed of all nodes from the r-labeled 
class. Hence we deduce that remaining nodes are from the g-labeled class. 
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• If {-Kq, ■nr)o, < IkqlU < UtT^IU ^nd |1 TT, j] q - (tT, , TT^) q > |1 TT^ |U - |1 TT, |1 „ , then 

the graph K has asymptotically two cliques: one formed by the nodes of 
class q and the other one by the nodes of class r. If the equality holds in 
the second inequality, there is either one clique as in the first case or two, 
depending on the selected gap. 

• If \\nq\\a < {■Kq,'rTr)a < lkr||a, the gap between ||7r,||Q, and {-Kq,TTr)a is 
necessarily strictly shorter than the one between (tt^, 77^)0, and UtTi-Hq. 
Indeed this amounts to saying that ||7rq — Tr^lP > 0. Thus K asymptotically 
forms one clique again. 



C Proof of Theorem 6.1 



Let us define (Ji)ig[„] the sequence of the intervals [T(i), r(i+i)[ sorted in order 
of decreasing length, hence for all i S [n], | Jj| = G^. We suppose hereafter that 
the sequence {T^q)q is sorted in increasing order: 7f i < • • • < ttq. 

Proof. On the event _B„, among the Qo ~ 1 largest intervals, we can associate 
with each 7fg the only one lying between tF^ and tF^+i. Namely the only Ji with 
* € [Qo — 1] such that JiC\]Tfq^Tfq^i\^ 0. S{q) denotes the index in [Qq — 1] 
corresponding to this unique interval. 

Moreover, s{q) denotes one of the indexes s G [Qo — 1] such that 7s = ifq+i — 
Tfq, chosen so that s is injective. Let us point out that 5' is a random permutation 
whereas s is deterministic. In order to simplify notations, we silently make the 
deterministic index change r — s{q). Thereby {jq)q still denotes the sequence 
(7s(g))(/j and S the permutation S 



o s 



-1 



Notice that on and especially when (i„ < |: 

[ifq + dn,Tfq+l - d„] C J s{q) C [tT, - d„, Vfg+i + (i„] 

Hence \Gs(q) - 7gl < 2d„. (2) 

1. We first prove that the gap Gs{q) converges to the theoretical gap 7g. For 
all t > 0: 

Pi\Gs{g) -l,\>t) = PQGsiq) - 7,1 > i n B„) + P{\Gsiq) - 7,1 > i n B„) 

< P{2dn >t)+ P(Bn) 

< 2(e-5"*' + e-A"^') + Qo{l - ao)" (3) 



2. Secondly, none of the Qq — 1 largest intervals permute anymore expect for 
those having the same theoretical values. It follows from the inequality Q that 
for all g, r € [Qq — 1], 

7? - 7r - 4(i„ < Gs{q) - Gs{r) < 7? - 7r + 4d„ 

Define 77 = | (min^g [gj (7^ — 7g+i) A (5), a threshold designed to distinguish 
distances converging to one value from those converging to another. On the 
event c?„ < 77, the previous inequality yields: 

7? - 7r - 4?7 < Gs{q) - Gs(r) < 7? - 7r + 4?7 
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• If 7q ~ 7r- < 0, then 7^ — 7^ + 477 < is also true by the definition of 77. As 
a result of the inequality just above, Gs{q) — Gs(r) < 0- 

• If 7g - 7r > 0, then 7, - 7^ - 4?? > 0, and Gs(q) - Gs(r) > 0. 

If (ui)i<i<m is a sequence, we write i j if and only if Ui = Uj. is an 
equivalence relation. Applying the Lemma |C.1| stated and proved afterwards, 
if dn < 77, there exists r q such that q = S{r). Notice furthermore that 
the sequence {'jq)q^[Qg_i^ is constant on the ^^-equivalence classes. The term 
10^ — 7^1 is necessarily in the sum X]r~(; l^sir) ~ 7r|- Finally, define 



Lemma C.l. Let (ui)i<i<m, (f i)i<i<m be two real decreasing sequences. Let p 
be the number of ^^-equivalence classes and a one permutation of {1, . . . ,to}. 
We especially assume that for all i, j € {1, . . . , m}, 

• Ui< Uj ^ V„(i) < V„(j^ 

• Ui> Uj ^ V„(i) > W^Q-) 

Then a = ai o ■ ■ ■ o ap where the support of ai is the i*'' ^u-equivalence class. 

Proof. Since u is decreasing, the ^^-equivalence classes are just sets of consec- 
utive natural integers. Define recursively (ri)i<i<p the increasing sequence of 
indexes j when the value of Uj changes: 

• Let ri — 1. 

• For j > 1, let ri_|_i be the smallest integer j > such that Ur^ = ■ ■ ■ = 

Uj^l > Uj. 

The construction of {ri)i implies that for all j < ri, all r^ < I < r^+i and all 
k > Ti+i: Uj < Uk < ui, and furthermore Wo-Q) < Vcr{k} < ''^<j{i) as well. As 
decreases, (j{{ri, . . . , r^+i — 1}) = {vi, . . . , r^+i — 1}. The result follows directly 
from this. 



P{\Gq - 7gl >t)= P{\Gq -lq\>tn B„) + P{\Gq - 7, I > * H S„) 




according to 



□ 



□ 
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